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Abstract 

An  atomic  snapshot  memory  is  a  shared  data  structure  allowing  concurrent 
processes  to  store  information  in  a  collection  of  shared  registers,  all  of  which 
may  be  read  in  a  single  atomic  scan  operation.  This  paper  presents  three 
wait-free  implementations  of  atomic  snapshot  memory.  Two  constructions 
implement  wait-free  single- writer  atomic  snapshot  memory  from  wait-free 
atomic  single- writer,  ra-reader  registers.  A  third  construction  implements  a 
wait-free  n-writer  atonaic  snapshot  memory  from  n- writer,  n- reader  registers. 
The  first  implementation  uses  unbounded  (integer)  fields  in  these  registers, 
while  the  other  implementations  use  only  bounded  re©sters.\  All  operations 
require  0(n^)  reads  and  writes  to  the  component  shared  ^^gisters  in  the 
worst  case. 


Keywords:  Distributed  systems,  shared  memory,  atomic  snapshots,  wait- 
free  algorithms,  read/write  atomic  registers,  serializability. 
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1  Introduction 


Obtaining  an  instantaneous  global  picture  of  a  system,  from  partial  obser¬ 
vations  made  over  a  period  of  time  as  the  system  state  evolves,  is  a  fun¬ 
damental  problem  in  distributed  and  concurrent  computing.  Indeed,  much 
of  the  difficulty  in  proving  correctness  of  concurrent  programs  is  due  to 
the  need  to  argue  based  on  “inconsistent”  views  of  shared  memory,  ob¬ 
tained  concurrently  with  other  process’s  modifications.  Verification  of  con¬ 
current  algorithms  is  thus  complicated  by  the  need  for  a  “non-interference” 
step  [Owi75,  OG76].  By  simplifying  (or  eliminating)  the  non-interference 
step,  atomic  snapshot  memories  can  greatly  simplify  the  design  and  verifi¬ 
cation  of  many  concurrent  algorithms.  Examples  include  exclusion  problems 
[K78,  L86c,  DGS88],  construction  of  atomic  multi- writer  multi- reader  reg¬ 
isters  [VA86,  Blo87,  PB87,  S88,  LTV89],  concurrent  time-stamp  systems 
[DS89],  randomized  consensus  [A88,  AH89,  ADS89,  A90]  and  wait-free  im¬ 
plementation  of  data  structures  [AH90]. 

This  paper  introduces  a  general  formulation  of  atomic  snapshot  mem¬ 
ory,  shared  memory  partitioned  into  words  written  (updated)  by  individual 
processes,  or  instantaneously  read  (scanned)  in  its  entirety.  It  presents  three 
wait-free  implementations  of  atomic  snapshot  memories,  constructed  from 
wait-free  atomic  registers.  (In  [A89a,  A89b,  An90],  Anderson  independently 
introduces  the  same  notion  and  presents  bounded  implementations.  See 
Section  6  for  a  discussion.)  The  first  implementation  uses  unbounded  (in¬ 
teger)  fields  in  these  registers,  and  is  particularly  easy  to  understand.  The 
second  implementation  uses  bounded  registers.  Its  correctness  proof  follows 
the  ideas  of  the  unbounded  implementation.  Both  constructions  implement 
a  single-writer  snapshot  memory,  in  which  each  word  may  be  updated  by 
only  one  process,  from  single-writer,  n-reader  registers.  The  third  algorithm 
implements  a  multi-writer  snapshot  memory  [A89b]  from  wait-free  atomic 
n-writer,  n-reader  registers,  again  echoing  key  ideas  from  the  earlier  con¬ 
structions.  Each  update  or  scan  operation  requires  0(n^)  reads  and  writes 
to  the  relevant  embedded  atomic  registers,  in  the  worst  case. 

A  related  data  structure,  multiple  assignment,  allows  processes  to  atom 
ically  update  nontrivial  and  intersecting  subsets  of  the  memory  words,  and 
to  read  one  location  at  a  time.  However,  multiple  assignment  has  no  wait- 
free  implementation  from  read/write  registers  [H88].  The  fact  that  wait-fret- 
atomic  snapshot  memories  can  be  implemented  from  wait-free  atomic  regis¬ 
ters  stands  in  contrast  to  the  impossibility  results  in  [H88]. 

Section  2  of  this  paper  defines  single-writer  and  multi-writer  atomic 
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snapshot  memories.  Section  3  contains  an  implementation  of  single-writer 
snapshot  memories  from  unbounded  single- writer  multi-reader  registers,  Sec¬ 
tion  4  presents  an  implementation  of  single-writer  snapshot  memories  from 
bounded  single-writer  registers,  and  Section  5  presents  an  implementation 
of  multi-writer  snapshot  memories  from  bounded  multi-writer,  multi-reader 
registers.  Section  6  concludes  with  a  discussion  of  the  results,  related  work 
and  directions  for  future  research. 

2  Atomic  Snapshot  Memories 

Consider  a  shared  memory  divided  into  words,  where  each  word  holds  a 
data  value.  In  the  single-writer  case,  there  is  one  word  for  each  process, 
which  only  it  writes  (in  its  entirety)  and  the  others  read.  In  the  multi-writer 
case,  any  of  the  words  may  be  read  or  written  by  any  of  the  processes.  An 
n-process  atomic  snapshot  memory  supports  two  types  of  operations,  scarn 
and  update,,  that  are  available  to  each  process  Pi.  Executions  of  scans  and 
updates  can  each  be  considered  to  have  occurred  as  primitive  atomic  events 
between  the  beginning  and  end  of  the  corresponding  operation  execution 
interval,  so  that  the  “serialization  sequence”  of  such  atomic  events  satisfies 
the  natural  semantics.  That  is,  each  scan  operation  returns  a  vector  v 
of  values  such  that  each  Vk  is  the  argument  of  the  last  update  to  word  k 
that  is  serialized  before  that  scan.  (This  variant  of  serializability  is  called 
“linearizability”  [HW87].)  This  intuition  is  made  precise  in  the  following 
subsection. 

Two  further  restrictions  are  imposed  on  implementations  of  atomic  snap¬ 
shot  memories.  First,  following  e.g.  [L86b,  H88],  any  snapshot  implemen¬ 
tation  is  required  to  be  constructed  with  single-writer,  multi-reader  atomic 
registers  as  the  only  shared  objects.  The  single-writer  algorithms  in  Sec¬ 
tions  3  and  4  satisfy  this  restriction  directly,  and  the  multi-writer  algorithm 
in  Section  5  satisfies  this  restriction  when  the  embedded  multi-writer  regis¬ 
ters  are  in  turn  implemented  with  one  of  the  previously  known  constructions 
from  single- writer  registers,  e.g.,  (PB87,  LTV89]. 

The  second  restriction  imposed  on  snapshot  memory  implementations  is 
that  they  satisfy  the  property  of  wait-freedom  [L86a,  P83].  That  is,  every 
snapshot  operation  by  process  Pi  will  terminate  in  a  bounded  number  of 
atomic  steps  of  Pi,  regardless  of  the  behavior  of  other  processes,  assuming 
only  that  local  steps  of  Pi  and  operations  on  embedded  shared  objects  ter¬ 
minate  in  bounded  time.  (The  reader  is  referred  to  [L86a,  H88,  AG88]  for 
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discussions  and  proposed  definitions  of  wait-freedom.) 

2.1  Formal  Specification  of  Single- Writer  Snapshot  Memo¬ 
ries 

Following  [LT87,  H88],  a  single-writer  atomic  snapshot  memory  for  n  pro¬ 
cesses  and  a  particular  value  set  Value  is  an  automaton  with  two  types 
of  input  Request  actions;  UpdateReque8tj(t;)  and  ScanRequest^,  and  two 
types  of  output  Return  actions:  UpdateRetum^  and  ScanRetum,(i;i, v„), 
for  any  i  G  {l-.n},  and  for  all  G  Value.  These  actions  are  called 

the  interface  snapshot  actions. 

The  formal  specification  of  single-writer  snapshot  memory  is  based  on  a 
particular  automaton,  SWS.  In  addition  to  the  interface  snapshot  actions, 
SWS  has  two  types  of  internal  actions,  Update^(v),  and  Scan,(wi, t;„),  for 
any  i  €  {l..n}  and  for  all  G  Value.  The  states  of  SWS  contain 

an  n-entry  array  Mem  of  type  Value  and  n  interface  variables  Hi.  The 
interface  variables  may  hold  as  value  any  of  the  interface  snapshot  actions, 
or  a  special  value  ±. 

Process  P,-  interacts  with  SWS  by  issuing  a  request  (an  UpdateRequest;(u) 
or  ScanRequest,-  action).  The  result  is  to  store  the  input  action  in  the  vari¬ 
able  Hi,  enabling  the  appropriate  internal  action  (Update^(v)  or  Scaii,(vi, ...,  Vn)). 
The  internal  action  in  turn  assigns  an  appropriate  output  action  to  Hi, 
and  in  the  case  of  Updatej(t;),  assigns  v  to  Afemj  as  well.  The  change  to 
the  interface  value  Hi  enables  the  appropriate  output  (UpdateReturn^  or 
ScanReturni(t;i,  ...,«„)  action).  Initially,  each  Hi  =  L  and  Afemj  =  Vinit  € 
Value. 

The  steps  of  SWS  appear  in  Figure  1,  with  the  convention  that  actions 
without  preconditions  are  always  enabled  (e.g.,  input  actions),  and  that  state 
components  not  explicitly  described  in  the  effect  of  an  action  are  presumed 
to  retain  their  old  value.  Note  that,  while  requests  and  returns  by  different 
processes  may  be  interleaved,  these  actions  only  alter  the  interface  variables 
for  the  associated  processes.  The  “real”  work  is  done  by  the  atomic  internal 
actions,  formalizing  the  intuition  that  operations  of  atomic  memories  can 
be  assumed  to  have  occurred  at  some  instant  between  the  invocation  and 
response.  Accordingly,  an  operation  of  SWS  in  a  is  said  to  be  serialized  at 
the  point  of  its  associated  Update  or  Scan  operation. 

The  well-formed  behaviors  of  SWS  are  those  in  which  the  environment 
never  issues  two  Requestj  inputs  without  waiting  for  an  intervening,  match¬ 
ing  Return,  output.  An  automaton  A  implements  a  single-writer  atomic 
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UpdateReque8ti(v) 

Effect:  Hi  :=  UpdateRequGStj(t;) 

Updatei(t;) 

Precondition:  Hi  =  UpdateRequestj(v) 

Effect:  Mem[i]  :=  v 

Hi  :=  UpdateRetum^ 

UpdateRetum,- 

Precondition:  Hi  =  UpdateReturn^ 

Effect:  Hi  :=  J. 

ScanRequest^ 

EflFect:  Hi  :=  ScanRequest^ 

Scani(vi,...,t;„) 

Precondition:  Hi  =  ScanRequest^ 

Mem  = 

Effect:  Hi  :=  ScanReturn,(t;i, v„) 

ScanReturnj(vi, v„) 

Precondition:  Hi  =  ScanRet\im,(t;i, 

Effect:  Hi  :=  1 

Figure  1:  The  SWS  automaton. 
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snapshot  memory  provided  A  has  the  interface  snapshot  actions  as  its  input 
and  output  actions,  and  provided  every  weU-formed  behavior  of  A  is  also  a 
behavior  of  SWS.  ^ 

2.2  A  Specification  of  Multi- Writer  Snapshot  Memories 

Multi- writer  snapshot  memories  are  straightforward  generalizations  of  single¬ 
writer  snapshot  memories,  and  can  be  specified  analagously.  Specifically,  a 
multi-writer  snapshot  memory  for  n  processes,  a  particular  value  set  Value 
and  m  memory  elements  is  an  automaton  with  input  actions:  UpdateRequest^(A:,  u), 
ScanRequest^ ,  and  output  actions:  UpdateReturn^,  ScanReturn,-(vi, ...,  Vm), 
for  all  1  k  E  {l,...,m},  and  G  Value. 

Straightforward  modifications  of  the  automaton  SWS  of  Figure  1  are 
used  to  constrain  implementations  of  multi-writer  snapshot  memories,  just 
as  SWS  constrained  single- writer  snapshot  memories.  (The  details  are  left 
to  the  reader.) 

3  The  Unbounded  Single- Writer  Algorithm 

The  algorithm  is  based  on  two  observations: 

Observation  1:  Suppose  every  update  leaves  a  unique,  indelible  mark 
whenever  it  writes  to  the  memory.  If  two  sequential  reads  of  the  entire 
memory  return  identical  values,  where  one  read  started  after  the  first  com¬ 
pleted,  then  the  values  returned  constitute  a  snapshot  [PB87]. 

This  observation  alone  supports  a  simple  unbounded  algorithm,  although 
one  which  is  not  wait-free.  The  feth  update  by  processor  P,  simply  writes 
the  update  value  v  and  a  sequence  number  A:  to  a  shared  register  in  a  single 
atomic  write.  Scanners  repeatedly  collect  the  values  of  all  ra  registers,  until 
two  such  collect  operations  return  identical  values.  By  Observation  1,  such 
a  successful  double  collect  is  a  snapshot. 

Because  updates  may  occur  between  every  two  successive  collect  opera¬ 
tions,  this  algorithm  is  not  wait-free.  However,  the  scanner  may  attribute 
every  unsuccessful  double  collect  to  a  particular  updating  process,  whose 
sequence  number  was  observed  to  change.  Thus: 

^Alternative  approaches  to  specifying  concurrent  objects  are  via  their  serial  specifica¬ 
tion  HW87  or  as  a  set  of  axioms  (cf.  [L86a,  M86]).  Axiomatic  specifications  for  snapshot 
memories  appear  in  [A89a,  A89b,  ADS89]. 
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Observation  2:  If  a  scaji  sees  another  process  move  (complete  an  update) 
twice,  that  process  executed  a  complete  update  operation  within  the  interval 
of  the  scan. 

Suppose  every  update  performs  a  scan  and  writes  the  snapshot  value 
atomically  with  the  value  and  sequence  number.  Now  a  scanner  who  sees 
two  updates  by  the  same  process  can  borrow  the  snapshot  value  written  by 
the  second  update. 

A  straightforward  implementation  uses  the  following  shared  data  struc¬ 
tures.  (See  Figure  2.)  Each  process  Pi  has  a  single-writer,  n-reader  atomic 
register,  that  Pi  writes  and  all  processes  read.  The  register  has  three 
fields,  value(ri)  (of  type  Value),  seq{ri)  (of  type  integer)  and  view(ri)  (a 
vector  of  n  Values).  The  value  and  view  fields  are  initialized  to  and 
the  seq  fields  are  initialized  to  0. 

The  value  of  seqi  is  stored  (locally)  across  invocations  of  updatCi.  In 
addition,  each  scan  operation  has  a  local  vector  moved,  in  which  it  records, 
for  each  other  process,  whether  it  has  performed  an  update  operation  that 
overlapped  the  scan  operation.  The  collect  operation  by  any  process  i  reads 
each  register  ry,  j  €  {!••»},  in  an  arbitrary  order,  returning  a  vector  of 
records  read,  indexed  by  process  id. 

3,1  Correctness  Proof 

The  proof  strategy  is  to  construct  an  explicit  serialization.  That  is,  given  an 
infinite  or  finite  run  of  the  system,  calls  and  returns  from  the  updatci  proce¬ 
dures  are  identified  with  the  UpdateRequest^  and  UpdateRetum,  actions, 
and  calls  and  returns  from  scarii  procedures  (unless  called  from  within  up¬ 
dates),  are  identified  with  the  ScanRequestj  and  ScanRetum,  actions.  The 
scan  and  update  operations  themselves  consist  of  sequences  of  more  primi¬ 
tive  operations  that  are  either  reads  and  writes  of  atomic  registers,  or  ma¬ 
nipulations  of  local  data.  The  former  are  atomic  by  assumption;  the  latter 
are  trivially  atomic.  Hence,  an  arbitrary  run  of  an  n-process  system  can  be 
considered  to  be  a  (possibly  infinite)  sequence  of  interface  snapshot  actions, 
and  atomic  reads,  writes  or  local  data  manipulations.  Given  this  sequence, 
Scan,  and  Update,-  actions  are  added  so  that  the  resulting  sequence,  pro¬ 
jected  on  the  actions  of  SWS,  is  a  schedule  of  that  automaton.  Hence,  the 
algorithm  is  atomic. 

Consider  then  any  sequence  a  =  7ri7r2...,  where  each  Xj  is  either  an  ac¬ 
tion  of  SWS,  a  read  readi{rj  =  u)  by  Pi  of  atomic  register  rj  returning  v,  a 
write  writei{ri  =  v)  by  P,  of  v  to  r,-,  or  a  local  computation  event.  Denote 
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procedure  scan,- 
begin 

0:  for  j  =  1  to  n  do  movedj  :=  0  od; 

1;  d  :=  collect;  /*  (value,  seq,  view)  triples  */ 

2;  6  :=  collect;  /*  (value,  seq,  view)  triples  */ 

3;  if  (Vj  e  {1-n})  (seq(aj)  =  seq(bj))  then 
4:  return  (value(bi),  ...,value(bn));  /*  Nobody  moved.  */ 

5:  for  j  =  1  to  n  do 

6:  if  seq(aj)  /  seq(bj)  then  /*  Pj  moved  */ 

7:  if  movedj  =  1  then  /*  Pj  moved  once  before!  */ 

8:  return  (met(;(6j)); 

9;  else  movedj  :=  movedj  +  1  ; 

od; 

10:  goto  line  1  ; 
end  scarii; 

procedure  update,  (value) 
begin 

1:  s  :=  scarii;  I*  Embedded  scan.  */ 

2:  Ti  :=  (value,  seq,  +  1,5)  ; 
end  updatci; 

Figure  2:  The  unbounded  single-writer  algorithm. 
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by  Ok  the  A:-length  prefix  of  a.  Although  the  internal  states  of  the  atomic 
register  implementations  are  not  known,  for  any  such  finite  prefix  of  a 
it  is  natural  to  define  the  state  of  the  shared  memory  after  ajt,  or  state{ak), 
to  be  the  vector  (ai,...,a„),  where  ai  is  the  value  of  the  last  write  by  pro¬ 
cess  Pi  in  Ofc,  or  the  initial  value  if  Pi  has  not  yet  written.  If  3tate{ak)  = 
(ai,...,an),  then  snapshot{ak)  denotes  {value{ai),...,value{an)).  The  se¬ 
quence  snapshot{aQ),snapshot{a\),snapshot{a2):.  serves  as  the  basis  for 
the  serialization  of  a. 

The  update  operations  are  serialized  at  the  same  point  in  the  run  as  their 
embedded  writes.  A  scan,  operation  has  a  successful  double  collect  when 
the  test  in  line  3  is  passed;  following  the  two  collects  a  :=  collect  in  line  1 
and  b  :=  collect  in  line  ?,  the  sequence  numbers  in  a  and  h  are  identical. 
Scans  with  successful  double  collects  are  serialized  between  the  end  of  the 
collect  in  H  ie  1  and  the  beginning  of  the  second  collect  in  line  2.  Lemma  3.1 
proves  that  the  values  returned  by  such  a  scan  constitute  a  snapshot  during 
this  interval. 

Lemma  3.1  Let  a  =  Xi7r2...  be  a  run  of  the  unbounded  algorithm  in  which 
a  particular  scan;  operation  has  a  successful  double  collect:  d  :=  collect  in 
line  1  and  b  :=  collect  in  line  2.  Let  7r„  and  be  the  last  read  of  the  first 
collect  and  the  first  read  of  the  second  collect,  respectively.  Then  for  every 
prefix  a„  of  a,  u  <  v  <  w,  snapshot{ay)  =  (value{bi),...,value{bn)). 

Proof:  By  contradiction.  That  is,  suppose  that  two  successive  reads  by  Pi 
of  rj  in  lines  1  and  2  return  the  same  sequence  number,  and  that  an  update 
by  Pj  is  serialized  between  the  two  reads.  Since  the  update  is  serialized 
with  its  embedded  write,  a  write  by  Pj  to  rj  also  occurs  between  the  two 
reads.  Furthermore,  the  sequence  number  in  the  second  read  must  be  strictly 
greater  than  the  sequence  number  in  the  first  read,  a  contradiction.  The 
lemma  fellows.  ■ 

The  remaining  scans  return  when  they  observe  an  updater  move  twice: 
they  will  be  serialized  in  the  same  interval  as  the  embedded  scan.  The  next 
lemma  guarantees  that  this  interval  is  contained  in  the  interval  of  the  scan. 

Lemma  3.2  Let  a  =  7ri7r2...  be  a  run  of  the  unbounded  algorithm  in  which  a 
particular  scan,  operation  observes  changes  in  process  Pj ’s  sequence  number 
field  during  two  different  double  collects.  Then  the  value  of  rj  read  during 
the  last  collect  was  written  by  a  scanj  operation  that  began  after  the  first  of 
the  fox  r  collects. 
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These  two  lemmas  imply  that  all  scans  are  correctly  serialized  somewhere 
in  their  intervals. 

Lemma  3.3  Let  a  =  7ri7r2...  be  a  run  of  the  unbounded  algorithm  in  which  a 
particular  scan,  operation  beginning  in  event  tt^  returns  (ui,  in  event 

TTtu.  Then  snapshot{ay)  =  (ui,  for  some  v,  u  <  v  <  w. 

By  the  pigeon-hole  principle,  in  n  -f  1  double  collects  one  must  be  suc¬ 
cessful  or  some  updater  must  be  observed  moving  twice.  Hence  scans  are 
wait-free.  This  in  turn  implies  that  updates  are  wait-free. 

Lemma  3.4  Every  scan  or  update  operation  by  process  Pi  returns  after 
0{n'^)  atomic  steps  of  Pi,  Vi  €  {l..n}. 

This  discussion  is  summarized  in  the  following  theorem. 

Theorem  3.5  The  unbounded  algorithm  implements  a  wait-free  single-writer 
snapshot  memory. 

4  The  Bounded  Single- Writer  Algorithm 

The  sequence  numbers  in  the  unbounded  algorithm  enable  scan  operations 
to  detect  changes  to  the  memory  due  to  concurrent  updates.  To  achieve  the 
same  effect  with  bounded  registers,  each  scanner /updater  pair  of  processes 
communicates  via  two  atomic  bits,  each  written  by  one  and  read  by  the 
other.  Before  performing  a  double  collect,  a  scan  operation  sets  its  bit  equal 
to  the  value  read  in  the  other  bit.  If  after  the  double  collect,  the  bits  are 
observed  by  the  scanner  to  be  not  equal,  then  the  updater  changed  its  bit 
(moved)  after  the  scanner’s  first  read  of  that  bit. 

Specifically,  the  bounded  single-writer  algorithm  of  Figure  3  replaces 
the  unbounded  sequence  number  field  of  r,  with  n  pairs  of  handshake  bits 
[P83,  L86b].  That  is,  for  each  process  pair  {Pi,Pj)  the  register  r,-  contains 
the  bit  field  p,  j.  Additional  atomic  single- writer  single- reader  bits  9,^  are 
written  by  P,-  and  read  by  Pj.  The  qij  bits  are  written  when  Pi  scans,  (to  the 
values  read  from  the  pj^i  bits)  and  the  pi^j  bits  are  written  when  P,  updates, 
(to  the  negations  of  the  values  read  from  the  gj,,-  bits).  An  additional  toggle 
bit,  toggle{ri),  is  changed  during  every  update,  to  ensure  that  each  write 
operation  changes  the  register  value. 
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procedure  scaui 
begin 

0:  for  j  =  1  to  n  do  moved j  ;=  0  od; 

0.5:  for  j  =  1  to  n  do  qi,j  ;=  od;  /*  Handshake.  */ 

1:  d  :=  collect,  /*  (value,  bit  vector, bit,  view)  tuples  */ 

2:  6  :=  collect,  J*  (value,bit  vector, bit, view)  tuples  */ 

3:  if  (Vj  €  {l..n}),  (Pi.i(aj)  =  Pi.i(fci)  = 

and  toggle(aj)  ~  toggle(bj))  then  /*  Nobody  moved.  */ 

4:  return  (value(bi),  ...,value(bn)y, 

5:  else  for  j  =  1  to  n  do 

6:  if  Pi.i(ai)  ^  QiJ  or  Pj,i(i>i)  /*  / 

or  toggle(aj)  ^  toggle(bj)  then 

7:  if  moved  j  =  1  then  /*  P,  moved  once  before!  */ 

8:  return  (view(bj)); 

9:  else  moved  j  :=  moved  j  +  1  ; 

od; 

10:  goto  line  0.5  ; 
end  scant; 

procedure  updatet  (value) 

begin 

0:  for  J  =  1  to  n  do  fj  :=  od; 

/*  Collect  handshake  values.  */ 

1:  s  :=  scam;  I*  Embedded  scan.  */ 

2:  r,  :=  (value,  f,->toggle(ri),s)  ; 
end  update,; 

Figure  3:  The  bounded  single-writer  algorithm. 
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4.1  Correctness  Proof 


For  this  algorithm,  a  successful  double  collect  is  a  pair  d  :=  collect,  b  := 
collect;  with  all  handshake  bits  pj^i  =  Qij  and  corresponding  toggle  bits  in 
d  and  b  identical.  The  main  issue  that  has  to  be  argued  is  that  the  hand- 
shaJie  and  toggle  bits  guarantee  that  a  successful  double  collect  produces  a 
snapshot.  This  is  proven  in  the  following  lemma. 

Lemma  4.1  Let  a  =  TTiTTa...  be  a  run  of  the  bounded  algorithm  in  which  a 
particular  scan,-  operation  has  a  successful  double  collect:  d  :=  collect  in  line 
1  and  b  :=  collect  in  line  2.  Let  5r„  and  be  the  last  read  in  line  1  and  the 
first  read  of  line  2,  respectively.  Then  for  every  prefix  a„  of  a,  u  <  v  <  w, 
snapshot(av)  =  {value{bi),  ...,value(bn)). 

Proof:  As  in  the  proof  of  Lemma  3.1,  the  proof  is  by  contradiction.  That 
is,  suppose  that  two  successive  reads  by  P,  of  rj  in  a  collect  pair  produce 
values  of  Pj,i{rj)  that  are  equal  to  9,j’s  most  recently  written  value,  and 
identical  toggleirjYs.  Assume  that  a  write  by  Pj  to  rj  is  serialized  between 
these  two  atomic  read  operations.  Consider  the  last  such  write  operation  by 
Pj]  being  last,  it  must  write  the  same  handshake  bit  6  and  toggle  bit  t  read 
by  Pi.  Since  during  an  update  Pj  assigns  to  pj^i  the  negation  of  the  value 
read  in  9,,j,  that  read  of  qij  must  have  preceded  Pi’s  most  recent  write  to 
qij  of  6.  This  implies  the  following  sequence  of  events: 

readj^qij  =  ->6),  /*  update:  haindshaJce  read  */ 

writei{qij  =  b),  /*  scan:  handshake  write  */ 

r€adi{pj^i{rj)  =  b,toggle{rj)  =  t)  j*  scan:  first  collect  */ 

write j{pj^i{Tj)  =  b,toggle{rj)  =  t)  /*  update:  write  */ 

readi(pj^i{rj)  =  b,toggle{rj)  =  t).  /*  second:  second  collect  */ 

The  first  operation,  the  read  by  Pj,  is  a  part  of  the  same  update  as  the 
later  write  by  Pj,  which  by  assumption  is  the  last  write  by  Pj  serialized 
between  the  two  reads  by  P,.  It  follows  that  no  other  write  operation  by  Pj 
can  be  serialized  between  Pi’s  two  reads.  Then  the  two  reads  by  Pj  of  rj 
return  values  written  by  two  successive  writes  by  Pj,  yet  the  toggle  bits  aje 
identical,  a  contradiction.  (The  first  of  these  writes  by  Pj  does  not  appear  in 
the  sequence  above:  it  is  Pj’s  most  recent  previous  write,  and  must  precede 
the  first  event  of  the  sequence,  readj(gi,j  =  Hence,  no  write  operation 
by  Pj  can  be  serialized  between  Pi’s  two  reads,  and  the  claim  follows.  ■ 
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The  serialization,  remaining  lemmas  and  theorem  from  the  unbounded 
algorithm  translate  directly  to  the  bounded  algorithm.  (It  is  important  that 
each  update  operation  changes  the  value,  handshake  and  toggle  fields  in  a 
single  atomic  write  operation.) 

Lemma  4.2  Let  a  =  Jri7r2...  be  a  run  of  the  bounded  algorithm  in  vahich 
a  particular  scan,  operation  observes  changes  in  process  Pj ’s  handshake  or 
toggle  bits  during  two  different  double  collects.  Then  the  value  of  rj  read 
during  the  last  collect  was  written  by  a  scanj  operation  that  began  after  the 
first  of  the  four  collects. 

Lemma  4.3  Let  a  =  7ri7r2...  be  a  run  of  the  bounded  algorithm  in  which  a 
particular  scan^  operation  beginning  in  event  returns  (vi,  event 

Then  snapshot{ay)  =  (wi,...,Un)  for  some  v,  u  <  v  <  w. 

Lemma  4.4  Every  scan  or  update  operation  by  process  Pi  returns  after 
O(n^)  atomic  steps  of  P,,  Vi  6  {l..n}. 

Theorem  4.5  The  bounded  algorithm  implements  a  wait-free  single-writer 
snapshot  memory. 

5  The  Bounded  Multi-writer  Algorithm 

Because  processes  may  now  write  to  any  memory  location,  the  handshake 
bits  and  view  fields  are  uncoupled  from  the  value  fields.  The  latter  are  stored 
in  multi- writer,  multi-reader  registers  Vk,  where  now  the  index  A:  is  a  memory 
address  not  related  to  process  indices.  To  ensure  that  each  successive  write 
to  these  registers  has  an  observable  effect,  an  id  field  and  toggle  bit  field  are 
also  included:  successive  update  operations  by  Pi  to  word  k  write  i  in  the 
id{rk)  field  and  alternate  values  in  the  toggle  field.  (The  id  field  also  allows 
a  scan  operation  to  attribute  an  observed  change  to  a  specific  process.) 

Because  the  handshake  bits  are  not  written  atomically  with  the  rt  regis¬ 
ters,  a  scan  may  observe  changes  by  the  same  update  operation  twice:  once 
changing  the  handshake  bits,  and  once  changing  the  value  of  a  memory 
word.  Hence,  a  scan  operation  must  observe  process  Pj  move  three  times 
before  the  value  in  viewj  can  be  borrowed. 

Hence,  the  algorithm  of  Figure  4  requires  a  multi-writer  multi-reader  reg¬ 
ister  rk  for  every  memory  address  k  €  {l,...,m},  holding  fields  value{rk), 
id{rk)  and  toggle{rk)  of  type  Value,  and  boolean.  In  addition,  for 
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every  process  Pi  there  are  2n  single- writer  multi-reader  boolean  registers  pj  j 
j  '^3  €  {l-n},  and  a  single-writer  multi-reader  register  viewi,  holding 
a  vector  of  m  Values.  The  scan  and  update  operations  of  a  process  i  are 
described  in  Figure  4. 

5.1  Correctness  Proof 

The  serialization  is  defined  as  in  the  previous  algorithms,  with  updates  se¬ 
rialized  with  the  (atomic)  writes  to  the  value  registers.  For  this  algorithm, 
a  successful  double  collect  occurs  when  the  test  in  line  3  is  passed.  This 
test  depends  on  steps  0.5  through  2.5,  recording  the  handshake  bits  and  the 
shared  registers  rj  twice:  Step  0.5  implicitly  collects  the  values  of  each  pj,,-, 
by  storing  pj^i  in  The  next  three  lines  explicitly  record  the  values  of 
the  Tk  registers  and  the  handshake  bits  in  d,  b  and  h,  respectively.  The  test 
is  passed  if  the  handshake  bits  and  id,  toggle  fields  of  the  registers  contain 
identical  values  in  each  pair  of  respective  reads.  Again,  the  main  issue  that 
has  to  be  argued  is  that  a  successful  double  collect  produces  a  snapshot. 

Lemma  5.1  Let  a  =  iriir2...  be  a  run  of  the  bounded  multi-writer  algorithm 
in  which  a  particular  scan;  operation  has  a  successful  double  collect,  includ¬ 
ing  d  :=  collect  in  line  1  and  b  :=  collect  in  line  2.  Let  7r„  and  tt^,  be  the 
last  read  of  line  1  and  the  first  read  of  line  2,  respectively.  Then  for  every 
prefix  of  a,  u<v<w,  snapshot{av)  =  {value(bi),  ...,value{bm))‘ 

Proof:  As  in  the  proofs  of  Lemmas  4.1  and  3.1,  the  proof  is  by  contradiction. 
Suppose  then  that  two  successive  reads  by  P;  of  rjt  both  produce  the  values 
td(rjt)  =  j  and  toggle(rk)  =  t,  and  the  two  reads  of  pj,;  also  produce  the 
same  value,  c.  Assume  that  an  update  to  word  k  and  hence  a  write  to  is 
serialized  between  the  two  atomic  reads  of  rjt  in  lines  1  and  2.  Consider  the 
last  such  write  operation:  because  the  second  read  by  Pi  returned  id{rk)  =  j, 
this  last  write  is  by  Pj.  Since  the  first  read  by  P;  also  returned  id(rjt)  =  j 
and  the  same  toggle  value  t,  there  must  be  another  intervening  write  by  Pj 
to  rjt,  with  toggle  value  -it,  serialized  between  the  two  reads  by  P;.  It  follows 
that  the  last  write  by  Pj  is  part  of  an  update  that  began  after  P;’s  first  read 
of  Tfc.  Within  that  update,  pj,;  is  set  to  ~‘qi,j.  Henceforth,  the  value  of  pj,; 
cannot  change  until  does,  so  the  last  read  by  P;  of  pj,;  recorded  in  hj 
must  see  it  equal  to  -iftj,  a  contradiction.  Hence,  no  writes  can  be  serialized 
between  the  two  reads  of  r^. 

The  full  sequence  of  atomic  events  constructed  in  this  argument  is  as 
follows: 
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procedure  «can, 
begin 

0:  for  j  =  1  to  n  do  moved j  :=  0  od; 

0.5:  for  j  =  1  to  n  do  :=  pj^i  od;  /*  Handshake.  */ 

1:  d  :=  collect{rk  :  fc  €  {1, . . . ,  m})  ;  /*  {value,  id, bit)  triples  */ 

2:  b  :=  collect{rk  :  k  G  {l,...,m})  ;  /*  {value, id, bit)  triples  */ 

2.5:  h  :=  collect{pji :  j  €  {l-n})  ;  /*  handshake  bits  */ 

3:  if(Vie{l..n}j(gi,,-  =  A,) 

and  (Vfc  €  {l,...,m})  {id{ak)  =  id{bk))  /*  Nobody  moved.  */ 
and  (VA:  €  {l,...,Tn})  {toggle{ak)  =  toggle{bk))  then 
4:  return  {value{bi),...,value{bjn))\ 

5:  else  for  y  =  1  to  n  do 

6:  if  (  {qij^  hj)  or  (  {Ik,  id{bk)  =  j)  /*  Pj  moved  */ 

(td(ofc)  id{bk)  or  toggle{ak)  ^  toggle{bk))  ))  then 
7:  if  moved  j  =  2  then  /*  P,  moved  twice  before!  */ 

8:  return  (viewj); 

9:  ebe  moved  j  :=  moved  j  +  1  ; 

od; 

10:  goto  line  1  ; 
end  scarn; 

procedure  update^  {k,value)  j*  Process  P,  writes  value  to  memory  word  k  */ 
begin 

0:  for  y  =  1  to  R  do  j  :=  od;  /*  Handshake.  */ 

1:  view,  :=  scauj;  /*  Embedded  scan:  vietnj  is  a  single-writer  register  */ 
1.5:  tfc  :=  -itfc;  /*  local  variable  t  saved  between  caUs  */ 

2:  Tk  :=  {value, i,tk)  ;  f*  rk  is  a  multi- writer  register  */ 

end  updatei'. 


Figure  4:  The  bounded  multi- writer  algorithm. 
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readi{pj^i  =  c),  /*  Pi's  first  handshake  collect  */ 

writei{qij  =  c),  /*  Pi's  handshake  write  */ 

readi{id{rk)  =  j,toggle{Tk)  =  t)  /*  Pi's  first  collect  */ 

write j{id{rk)  =  j,  toggle(rk)  =  -li))  /*  Pj's  toggle  bit  write  */ 

readj{qij  =  c)  /*  Pj’s  handshake  read  for  second  write  */ 

write j(pj^i  =  -ic)  /*  Pj's  handshake  write  for  second  write  */ 
writej{id{rk)  =  j,toggle{rk)  =  t))  /*  Pj’s  assumed  write  */ 

readi{id{rk)  =  j,toggle{rk)  =  t))  /*  P,’s  second  r*;  collect  */ 

readi{pj^i  =  c))  /*  second  handshake  collect  */ 


It  follows  that  a  scanner  with  a  successful  double  collect  can  conclude 
that  no  writes  are  serialized  between  the  last  read  in  line  2  and  the  first  read 
in  line  3.  Hence,  the  values  read  are  a  snapshot,  and  the  lemma  follows.  ■ 
The  previous  lemma  says  that  the  scans  with  successful  double  collects 
can  be  serialized  correctly.  It  remains  to  argue  that  the  scans  which  return 
borrowed  values  use  values  from  scans  that  run  entirely  within  their  interval. 
As  discussed,  the  crucial  embedded  scan  lemma  must  make  concession  to 
the  non-atomicity  of  writes  to  the  handshake  and  value  registers. 

Lemma  5.2  Let  a  =  7ri7r2...  be  a  run  of  the  bounded  multi-writer  algo¬ 
rithm  in  which  a  particular  scan,-  operation  detects  changes  in  process  Pj 's 
handshake  bit  or  writes  by  Pj  to  value  registers  during  three  different  dou¬ 
ble  collects.  Then  the  value  of  viewj  read  by  Pi  after  the  last  collect  was 
returned  by  a  scauj  operation  that  began  after  the  first  of  the  six  collects. 

Proof:  The  proof  of  this  lemma  rests  on  the  sequence  of  relevant  atomic 
write  steps  that  Pj  maJces  in  successive  updates: 

write  to  pj^i 
write  to  viewj 
write  to  r*, 
write  to  pj^i 
write  to  viewj 
write  to  r*. 
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Observing  any  three  changes,  in  the  pj^i  or  value  registers,  means  that 
an  intervening  scan  must  have  taken  place  and  have  been  recorded  in  viewj. 
Either  this  scan  or  a  more  recent  scan  by  Pj  will  be  read  by  Pj.  ■ 

These  two  lemmas  imply: 

Lemma  5.3  Let  a  =  7riir2...  be  a  run  of  the  bounded  multi-writer  algorithm 
in  which  a  particular  scan,  operation  beginning  in  event  returns  {vi , t;„) 
in  event  Then  snapshot{av)  =  (wj, ...,  v„)  for  some  v,  u  <  v  <  w. 

As  before,  the  pigeon-hole  principle  implies  that  in  2n-l- 1  double  collects 
one  must  be  successful  or  some  updater  must  be  observed  moving  three 
times.  Hence  scans  are  wait-free.  This  in  turn  implies  that  updates  axe 
wait-free. 

Theorem  5.4  The  bounded  multi-voriter  algorithm  implements  a  wait-free 
multi-writer  snapshot  memory. 

6  Discussion  and  Directions  for  Further  Research 

The  distributed  snapshot  of  Chandy  and  Lamport  [CL85]  provides  a  simple 
solution  to  the  similar  problem  for  message-passing  systems.  The  distributed 
snapshot  algorithm  has  proven  a  useful  tool  in  solving  other  distributed 
problems  (see,  e.g.,  [G86,  BT84)),  and  it  is  likely  snapshot  memories  will 
play  a  similar  role  in  concurrent  programming. 

Interestingly,  distributed  snapshots  are  not  true  instantaneous  images  of 
the  global  state,  such  as  scans  of  snapshot  memories  produce.  However,  dis¬ 
tributed  snapshots  are  indistinguishable,  within  the  system  itself,  from  true 
instantaneous  images.  By  applying  the  emulators  of  [ABD]  to  the  construc¬ 
tions  presented  in  this  paper,  implementations  of  atomic  snapshot  memory 
are  obtained  in  mess^e- passing  systems.  Snapshots  obtained  this  way  are 
true  instantaneous  images  of  the  global  state.  In  addition,  these  implemen¬ 
tations  are  resilient  to  process  and  link  failures,  as  long  as  a  majority  of  the 
system  remmns  connected. 

Anderson  [A89a,  An90]  has  obtained,  independently,  bounded  implemen¬ 
tations  of  single-writer  atomic  snapshots.  Memory  operations  in  Anderson’s 
implementation  of  the  single-writer  snapshot  memory  perform  0(2’’)  reads 
auttd  writes  to  atomic  single-writer  multi-reader  registers,  in  the  worst  case. 

Anderson  originally  posed  the  multi-writer  snapshot  problem,  and  uses 
single-writer  atomic  snapshots  lo  construct  multi-writer  atomic  snapshots 
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[A89b,  An90].  Together  with  the  bounded  single-writer  algorithm  of  this 
paper,  this  provided  the  first  polynomial  construction  of  a  shared  memory 
object  that  can  be  instantaneously  checkpointed.  The  multi-writer  algo¬ 
rithm  of  this  paper  gives  an  alternative  implementation,  building  instead  on 
multi-writer  atomic  registers.  The  efficiency  of  these  constructions  may  be 
compared  by  considering  two  compound  constructions,  tracing  back  to  oper¬ 
ations  on  single-writer  atomic  registers.  Anderson’s  multi-writer  algorithm, 
based  on  the  bounded  single- writer  algorithm  of  this  paper,  requires  0(n‘‘) 
single-writer  operations  per  update  or  scan  operation  in  the  worst  case.  Our 
multi- writer  algorithm,  based  on  multi- writer  registers,  in  turn  implemented 
from  single-writer  registers,  requires  0(n^)  single-writer  operations  per  up¬ 
date  or  scan  operation  in  the  worst  case  (using  the  most  efficient  known 
construction  of  multi-writer  registers  from  single-writer,  due  to  Li,  Tromp 
and  Vitanyi  [LTV89]).  It  is  interesting  to  speculate  whether  other,  more 
efficient  solutions  can  be  found. 

Indeed,  an  interesting  open  question  is  the  inherent  complexity  of  imple¬ 
menting  atomic  snapshots,  in  terms  of  both  time  and  space.  In  all  known 
bounded  algorithms  the  scanners  write  to  the  updaters-is  this  necessary? 
The  scans  do  a  large  number  of  reads-is  this  also  necessary? 

Another  question  is  to  find  other  applications  for  atomic  snapshots,  in 
addition  to  the  ones  described. 

The  most  challenging  avenue  of  research  seems  to  be  the  relation  be¬ 
tween  the  power  of  unbounded  and  bounded  wait-free  algorithms.  Can 
any  primitive  that  is  not  syntactically  unbounded^  be  implemented  using 
bounded  shared  memory?  Specifically,  is  there  a  uniform  transformation  of 
any  unbounded  wait-free  solution  for  some  problem  into  a  bounded  wait-free 
solution?  Even  a  precise  definition  of  this  class  of  problems  is  not  obvious. 

Finally,  snapshot  memories,  though  apparently  more  powerful  than  reg¬ 
isters,  nevertheless  have  bounded  wait-free  implementations  from  those  sim¬ 
ple  primitives.  Herlihy  showed  that  many  interesting  primitives  do  not  have 
wait-free  implementations  from  registers  [H88].  Is  it  possible  to  “close  the 
gap”  further,  and  construct  yet  more  powerful  primitives  from  registers? 
More  ambitiously,  is  it  possible  to  construct  a  hierarchy  of  objects  imple- 
mentable  from  atomic  registers,  providing  a  theoretical  basis  for  the  intuition 
that  snapshot  memories  are  more  powerful  single-writer  registers? 

^Clearly,  procedures  that  return  integer  or  other  unbounded  values  will  not  have 
bounded  implementations. 
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